feat(speedwagon): add DescriptionAgent for KB-level description generation#47
Conversation
…ments - Drop the "alongside other KBs" framing from the prompt opening so the model isn't primed to emit comparison vocabulary; the existing "do not compare" clause now matches the framing. - Note Korean output's ~1/3 char density at the same budget; per-language budgets are deferred (LLM can't count Korean words reliably either). - Trim verbose doc-comments across description.rs and Store::describe; add cost note (~24K input chars at N=200) to discourage synchronous use on indexing hot paths. Verified against the 4-KB / 12-probe harness: routing accuracy stays 12/12 across the shipped baseline, the prompt-only change, and word-budget variants. cargo test -p speedwagon --lib description: 12 passed, 1 ignored.
|
(I apologize for bringing this up unrelated. I discovered this while running this PR test.) @jhlee525 It is irregular, so maybe we need to fix this. How should we modify this? |
It appears that for stores containing a Korean document set, the Since the Just as an LLM thinks in English even while responding in Korean, if using English I do not have a particular preference regarding language fixed proposals; however, if you are interested in this, I would like to add a bit more justification for it. (Making a decision after accurately verifying it through benchmarks could be an approach. However, since the characteristics of the datasets currently available are relatively clear, I suspect this will not be easy for LLMs to get confused by; and also, I believe the scope of the decision is too small in comparison. So I wouldn't recommend.) |
I agree with this direction. For AgentCard descriptions, English may also be more token-efficient in many cases, especially compared to Korean descriptions, while improving consistency and readability. |
Switch `&[(String, String)]` / `&[String]` to `&[(&str, &str)]` / `&[&str]` in `generate`, `get_description`, `build_user_message`, and `fallback_description`, and drop the upfront title/purpose clone in `Store::describe`. Caller-side `Document` strings are borrowed directly, and the fallback title vec is built only on the empty-LLM-response branch.
grf53
left a comment
There was a problem hiding this comment.
In addition to the notes mentioned, I believe we could consider using 'existing descriptions' and 'data from modified documents' to avoid repeatedly inputting nearly identical data into the LLM when describing.
However, since this method can cause the output to be overly skewed toward the changes and so requires observation and policy from a higher perspective, it would be advisable to consider it when improvements are needed.
LGTM
Recency bias is actually the main reason for the current policy. Purpose is capped at 200 chars (~120 avg), so |
* refactor: rename knowledge-agent crate to speedwagon and restructure modules Renames the crate and reorganizes source into store (indexer, parser, searcher, translator) and tool modules. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * add functions * update tools * add global tool function * add cli & tests * add get_many & ingest_many functions * update internal features * apply latest ailoy, initialize backend v2 * update ailoy the one ToolFactory is Send + Sync * find: line-based matching with bare-word fallback and query syntax (#40) Match line-by-line internally and report byte offsets externally. Bare-word queries get progressive AND -> HALF -> OR fallback; the level used is surfaced in the response so callers can tell strict matches from relaxed ones. Replace the single case-insensitive regex query with a small structured-query parser supporting "phrase", +term/-term, AND/OR/NOT, (group), and /regex/. Any explicit operator opts out of fallback so caller intent is preserved. Co-authored-by: nuri-yoo <nuri-yoo@users.noreply.github.com> * Add purpose metadata field to Store ingest (#41) Add LLM-generated `purpose` as a third indexed field alongside `title` and `content`. The prompt and preview strategy match the v1 variant from the metadata ablation report (first 3000 chars, "what queries should find this document?" framing), which improved BM25 hit@5 by +15.3pp on FinanceBench over the title-only baseline. - `Document` gains a `purpose: String` field. - `parser::get_purpose()` calls a new `PurposeAgent` (gpt-5.4-mini, JSON response parsed via `serde_json`). Empty/invalid responses fall back to an empty string with a warn log so ingest still succeeds; transport-level errors propagate. - Tantivy schema adds `purpose` as `TEXT|STORED`. `open_or_create` detects schema mismatch on existing indexes and rebuilds the index directory automatically (corpus is preserved, so re-ingest only pays the LLM cost). - `Store::ingest` and `Store::ingest_many` invoke `get_title` and `get_purpose` in parallel via `tokio::try_join!`. - `QueryParser` default fields are extended to `[title, purpose, content]` so purpose terms participate in BM25 scoring. Closes #37 Co-authored-by: nuri-yoo <nuri-yoo@users.noreply.github.com> * add basic apis to test sandbox and messaging * add keep-alive on SSE, factor out build_agent * add GET/DELETE messages API * use latest of ailoy PR#391 * update ailoy * Add HTML support for Store (#46) * feat: add HTML file type support for `Store` * refactor(translator): split translator into files based on FileType * refactor(speedwagon): centralize filetype mapping and translator dispatch * feat(speedwagon-cli): apply shared filetype mapping to ingest * fix: apply chrome-stripping universally instead of site-specific branch * fix: use single-quoted string to avoid complex escapes * fix: adjust condition to prepend title and * fix: add more chrome types * refactor: use `html_to_markdown_rs`, not `dom_smoothie` and `dom_query` * feat(speedwagon): add DescriptionAgent for KB-level description generation (#47) * feat(speedwagon): add DescriptionAgent for KB-level description generation * refactor(speedwagon): self-anchor description prompt and trim doc-comments - Drop the "alongside other KBs" framing from the prompt opening so the model isn't primed to emit comparison vocabulary; the existing "do not compare" clause now matches the framing. - Note Korean output's ~1/3 char density at the same budget; per-language budgets are deferred (LLM can't count Korean words reliably either). - Trim verbose doc-comments across description.rs and Store::describe; add cost note (~24K input chars at N=200) to discourage synchronous use on indexing hot paths. Verified against the 4-KB / 12-probe harness: routing accuracy stays 12/12 across the shipped baseline, the prompt-only change, and word-budget variants. cargo test -p speedwagon --lib description: 12 passed, 1 ignored. * feat(speedwagon): force description output to English * refactor(speedwagon): borrow doc slices in description path Switch `&[(String, String)]` / `&[String]` to `&[(&str, &str)]` / `&[&str]` in `generate`, `get_description`, `build_user_message`, and `fallback_description`, and drop the upfront title/purpose clone in `Store::describe`. Caller-side `Document` strings are borrowed directly, and the fallback title vec is built only on the empty-LLM-response branch. --------- Co-authored-by: nuri-yoo <nuri-yoo@users.noreply.github.com> * Integrating helper agents into a common interface (#50) * refactor(speedwagon): use ailoy default_provider for LLM helpers * refactor(speedwagon): consolidate LLM helpers under HelperAgent trait * refactor(speedwagon): tighten HelperAgent contract and response handling * refactor(speedwagon): pick helper model from preference list, fall back when none registered * Test speedwagon tool (#48) * UPDATE : create session with speedwagon * ADD : Session message & e2e rough test Co-authored-by: Copilot <copilot@github.com> * chore : test enhance & making agent change graceful Co-authored-by: Copilot <copilot@github.com> * UPDATE : instructions of swcard/main-agent * ADD : commented build_agent with direct speedwagon tool Co-authored-by: Copilot <copilot@github.com> * remove : duplicated dep in dev * update ailoy * align with latest ailoy develop * refactor: remove dead code and deduplicate e2e test helpers - Remove unused `into_runtime` and `into_runtime_with_provider` from SpeedwagonSpec (never called anywhere) - Migrate e2e_test.rs to use shared `common` module helpers instead of local `json_request` and `extract_assistant_text` duplicates - Fix `.sandbox()` → `.runenv()` in commented-out alternative build_agent Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Copilot <copilot@github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: khj809 <onsealeatang@gmail.com> * chore(deps): bump ailoy to post-#391 (AgentBuilder + sandbox sharing) (#51) Bumps ailoy from c42231e1 (2026-04-21) to 098a8289 (PR #391, 2026-05-04). This pulls in 27 commits worth of breaking changes; this PR only patches speedwagon to compile and pass tests against the new API. AgentBuilder itself is not adopted yet — that's PR #52. Breaking changes absorbed - ailoy#389: ToolFactory introduced; tool sources now live on ToolProvider.custom(ToolFactory::simple(desc, func)) rather than ToolSet.insert(name, desc, func). - ailoy#390: RunEnv trait + sandbox feature split. AgentState carries an Arc<dyn RunEnv>, defaulting to Local. No direct touch in this PR; speedwagon never constructed sandboxes. - ailoy#391: Provider registry overhaul. AgentProvider.models is now a LangModelProvider with .insert(pattern, LangModelProviderElem::API{..}) / .get(name) glob-matching. The old default_provider_mut().model_openai() / model_claude() / model_gemini() helper constructors were removed, along with provider.get_model(...). Agent::try_with_tools(spec, provider, toolset) is gone — tools must live on provider.tools. Speedwagon changes - tool/mod.rs: build_toolset(store) -> ToolSet renamed to build_tool_provider(store) -> ToolProvider, using ToolFactory::simple. - main.rs::build_agent: no separate toolset arg; clones the global provider, overrides provider.tools with the store-bound tools, Agent::try_with_provider(spec, &provider). - store/helper.rs:88: provider.get_model(m) -> provider.models.get(m). Same semantics, new accessor location. - New module speedwagon::provider with register_provider_from_env that reads OPENAI_API_KEY / ANTHROPIC_API_KEY / GEMINI_API_KEY and registers the same glob patterns the removed ailoy helpers used (openai/*, anthropic/claude-*, google/gemini-*). main.rs and the description.rs integration test both call this — keeps the env-key → glob-pattern mapping in one place, preserving PR #49's invariant that helper modules never read env directly. Out of scope - chat-agent / backend / knowledge-agent: these were already broken on the refactoring-applied baseline (workspace member vs. standalone ambiguity). They will be brought back in a separate PR alongside the AgentBuilder migration (planned PR #52). Verification cargo check -p speedwagon --tests --all-features # clean cargo test -p speedwagon --lib --all-features # 71 passed; 0 failed; 2 ignored Co-authored-by: nuri-yoo <nuri-yoo@users.noreply.github.com> * feat: add auth/user APIs (#54) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * ADD: batch document ingest + bulk purge with partial success (#57) * feat: batch document ingest + bulk purge with partial success - Add `Store::ingest_many` partial success (IngestResult/IngestFailure) with batch index optimization and best-effort cleanup on failure - Add `Store::purge_many` (PurgeResult/PurgeFailure) - POST /documents: multi-file multipart upload with per-file validation - DELETE /documents: bulk purge via JSON body { ids: [...] } - GET /documents/{id}: single document retrieval - Response DTOs: BatchIngestResponse, BatchPurgeResponse, FailedItem - 14 new document tests + e2e test rewritten for multi-doc HTTP flow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Preserve batch-ingest failure evidence while reducing helper duplication Keep the document batch API behavior unchanged while removing a silent fallback in failure indexing and trimming duplicated test HTTP setup. Constraint: Cleanup is scoped to PR #57 / a69222c document-batch changes only.\nRejected: Rewriting broader Speedwagon parser/tool clippy warnings | outside the requested commit scope.\nConfidence: high\nScope-risk: narrow\nDirective: Keep ingest_many response semantics provisional until the Store contract is hardened.\nTested: cargo fmt --check -p agent-k-backend -p speedwagon; cargo check -p agent-k-backend; cargo test -p agent-k-backend --test document_test; cargo test -p speedwagon --no-default-features --lib; cargo clippy -p agent-k-backend --tests\nNot-tested: live ignored e2e RAG test requiring OPENAI_API_KEY * use aide::axum::routing * Keep document batch failures item-scoped Review feedback showed batch ingest and purge paths could hide item-level failures or over-clean existing artifacts. This keeps API ids string-shaped at the boundary while parsing per item before store operations. Constraint: PR #57 review requested String id consistency and explicit multipart/corpus failure handling Rejected: Converting speedwagon document ids to Uuid | would broaden index/tool/CLI scope beyond PR Confidence: high Scope-risk: narrow Directive: Keep speedwagon index/tool IDs string-shaped unless a broader migration is planned Tested: cargo fmt --check -p speedwagon -p agent-k-backend; cargo test -p speedwagon --lib; cargo test -p agent-k-backend --test document_test; cargo check -p speedwagon -p agent-k-backend; git diff --check Not-tested: clippy -D warnings; blocked by preexisting warnings outside this change --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: khj809 <onsealeatang@gmail.com> * update ailoy * feat: add Project workspace and session sharing model (#62) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: jhlee525 <bmrcreative90@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: nuri <yoonuri1@gmail.com> Co-authored-by: nuri-yoo <nuri-yoo@users.noreply.github.com> Co-authored-by: Park Woorak <wrpark@brekkylab.com> Co-authored-by: JaehunLee <ljhh0611@gmail.com> Co-authored-by: Copilot <copilot@github.com>
Ticket
resolves #45
Summary
Adds a
DescriptionAgentto the speedwagon crate that turns a KB's(title, purpose)list into one ~200-character description. Pairs withPurposeAgentfrom #41: same Ailoy stack, same JSON-with-fallback parser. Library work only — no DB column, noManual | Autotoggle, no HTTP endpoint. A backend that wants to refresh its descriptions callsStore::describeand writes the result back.The motivating failure mode: chat-agent reads each Speedwagon's
descriptionin two places (chat-agent/src/speedwagon/dispatch.rs:110-114as the tool description, andbackend/src/prompt.rs:66as the system prompt KB list). Today that field is plain user input that goes stale the moment documents change. An offline harness reproduced the silent-degradation case: with a four-KB / twelve-probe harness (three hand-written cross-domain questions per KB), a single empty description drops cross-domain routing accuracy from 12/12 (all probes routed to the right KB) to 11/12. AllX/12numbers in this PR refer to this harness.Changes
description.rs(new)DescriptionAgentmirrorsparser::PurposeAgent. Inputs: KB name, optional instruction, the full(title, purpose)list. Output: a single line, target ~200 chars.get_descriptionis the env-backed entry point — same shape asparser::get_title/parser::get_purpose:dotenvy::dotenv().ok(), build provider fromOPENAI_API_KEY, run the agent, swap infallback_descriptionif the body comes back empty.fallback_description(N, top_titles)writes"{N} documents including: {top-5 titles}". The harness compared this against the empty-string fallback and"{top-20 titles}"; empty-string dropped to 11/12 (the no-description KB got bypassed when a query was ambiguous), and N+top-5 matched the longer fallback at a quarter of the tokens.Prompt body
The system instruction is hardcoded. The shipped prompt is a self-anchored revision — its opening sentence does not mention "alongside descriptions of other knowledge bases", so the model is not primed to emit comparison vocabulary. Earlier candidates that did mention peers still routed 12/12 on the harness, but the self-anchored framing keeps the prompt aligned with the later "do not compare" clause and is the cleaner default.
Store::describe(mod.rs)Store::describe(kb_name, instruction)pulls(title, purpose)from the index it already owns and forwards toget_description. Empty index returns""without an LLM call. The doc-comment notes the cost: one LLM call, input ≈ 24K chars at N=200 docs — callers should not invoke this synchronously inside indexing hot paths.Prompt sweep
Real outputs at length ~200, four-KB / twelve-probe harness, shipped prompt:
Descriptions are forced to English by the prompt regardless of document language; kpaperqa's output is in English even though its corpus is Korean. Neither the shipped prompt nor the earlier peer-aware candidate references other KBs by name.
Other knobs from the same harness, fixed as single values in the code:
AgentCard.title + purposevspurpose-only. Both 12/12; purpose-only saves ~25% of input tokens, so the prompt only renderspurpose. The fallback string still uses titles since they are the cheapest identity signal when the LLM is unavailable.The probe set itself:
The probe set is admittedly easy: four disjoint domains. Same exercise on near-domain KBs (e.g.
finance-2023vsfinance-2024) might land on different values; none such exist yet.Tests
12 new unit tests (parser variants, fallback shape, user-message rendering, empty-store short-circuit) plus one
#[ignore]-gated integration test that prefills the index viaindexer::add_documentand runsStore::describeend-to-end.The integration test passes locally in ~3s. Run it with:
Notes
OPENAI_API_KEY), same shape asPurposeAgentandTitleAgent. Unifying the three utility agents under one provider-routing abstraction belongs in a separate issue.